Final Project: Baseball Over the Last 40 Years

Author

Samuel Harris

Research Question + Variables

  • Research Question: How has baseball changed in the past 40 years, and why has it changed?

  • I think this is interesting because baseball has taken a more statistical turn in the past 20 years, and I believe this should be reflected in the data. Teams started to value players based on different metrics, and I believe those desired metrics have increased over time in overall team stats.

  • The data I’m using is from Lahman’s Baseball Database which is found here, http://seanlahman.com/download-baseball-database/. I am focusing on a subset from Lahman’s Database called “teams” which contains various team statistics like home runs and strikeouts each year from 1871 to 2022.

  • Here’s a glance at the summary statistics of the numeric variables I will be focusing on from the data set. I chose these variables since they are fundamental in baseball. The data has been cleaned to only contain team statistics from 1980 onward, and statistics are normalized to per game.

Code
library(gridExtra)
library(corrplot)
library(tidyverse)
library(estimatr)
library(skimr)
library(tidyr)

rm(list=ls())
b = read_csv("Teams.csv", show_col_types = FALSE)

#Add win percentage
b = b %>% 
  mutate(winpct = W/G)

#removing IPouts
b = b %>% 
  select(-IPouts)

# normalizing variables to per game and subseting from 1980 on
b <- b %>%
  mutate_at(vars(15:28,30:37), ~ round(. / b$G, 2))

b = b[b$yearID >= 1980,]
b = b %>% 
   select(yearID, name, franchID, H, HR, SO, R, winpct)

b %>% 
  skim() %>% 
  yank('numeric')

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
yearID 0 1 2001.68 12.29 1980.00 1991.00 2002.00 2012.00 2022.00 ▇▆▇▇▇
H 0 1 8.85 0.55 6.50 8.48 8.85 9.22 10.40 ▁▁▇▇▂
HR 0 1 1.00 0.26 0.29 0.81 0.99 1.17 1.97 ▁▇▇▃▁
SO 0 1 6.67 1.24 3.61 5.74 6.52 7.50 10.13 ▁▇▇▅▁
R 0 1 4.53 0.53 3.10 4.16 4.50 4.87 6.23 ▁▆▇▃▁
winpct 0 1 0.50 0.07 0.27 0.45 0.50 0.56 0.72 ▁▅▇▆▁

Variables over Time

  • To look at which variables have changed over time, I calculated the mean for each variable grouped by year. I put these means into a data frame so that league-wide means for each are displayed in the data frame.
  • The outcome variables of interest over time are hits, home runs, and strikeouts.
Code
# Create new data set with means by year

summary_stats <- b %>%
  group_by(yearID) %>%
  summarise_at(vars(3:5), mean)

knitr::kable(summary_stats, digits = 2, caption="League-wide Means by Year")
League-wide Means by Year
yearID H HR SO
1980 9.06 0.73 4.80
1981 8.67 0.64 4.75
1982 8.94 0.80 5.04
1983 8.88 0.78 5.15
1984 8.88 0.77 5.35
1985 8.74 0.86 5.34
1986 8.77 0.91 5.87
1987 9.00 1.06 5.96
1988 8.63 0.76 5.56
1989 8.62 0.73 5.62
1990 8.74 0.79 5.67
1991 8.69 0.80 5.80
1992 8.68 0.72 5.59
1993 9.05 0.89 5.80
1994 9.30 1.03 6.18
1995 9.17 1.01 6.30
1996 9.33 1.09 6.46
1997 9.15 1.02 6.60
1998 9.15 1.04 6.56
1999 9.33 1.14 6.41
2000 9.31 1.17 6.45
2001 9.03 1.12 6.67
2002 8.92 1.04 6.47
2003 9.06 1.07 6.34
2004 9.17 1.12 6.55
2005 9.05 1.03 6.30
2006 9.28 1.11 6.52
2007 9.25 1.02 6.62
2008 9.05 1.00 6.77
2009 8.96 1.04 6.91
2010 8.76 0.95 7.06
2011 8.70 0.94 7.10
2012 8.66 1.02 7.50
2013 8.66 0.96 7.55
2014 8.56 0.86 7.70
2015 8.67 1.01 7.71
2016 8.71 1.15 8.03
2017 8.69 1.26 8.25
2018 8.44 1.15 8.47
2019 8.65 1.40 8.82
2020 8.04 1.28 8.68
2021 8.13 1.22 8.68
2022 8.16 1.07 8.40

Correlation Over Time

  • To visualize correlation between variables, I created a correlation plot for all variables.
Code
## Create Correlation Plot

corr_matrix <- cor(summary_stats)

corrplot(corr_matrix, type = "upper", order = "hclust", tl.col = "black",addCoef.col = "white", number.cex = 1, method='color')

  • I then created a table that showed the correlation from each variable to the year variable. Strikeouts were highly correlated, increasing steadily over the past 40 years.
Code
#### Create Dataframe
result <- data.frame(Variable=character(), Cor=numeric(), Pval=numeric(), stringsAsFactors=FALSE)

for (i in 2:4) {
  corr <- cor.test(summary_stats[[i]], summary_stats$yearID)
  result[i, "Variable"] <- colnames(summary_stats)[i]
  result[i, "Cor"] <- round(corr$estimate, 2)
  result[i, "Pval"] <- round(corr$p.value, 2)
}
result <- na.omit(result)

## Show Variables correlation over time
result = result %>% 
   arrange(desc(abs(Cor)))

knitr::kable(result, caption= "Correlation to Year Table")
Correlation to Year Table
Variable Cor Pval
SO 0.96 0.00
HR 0.73 0.00
H -0.41 0.01

Scatter Plots

  • These scatter plots visualize each variable’s correlation with time.

  • Again, it’s easily seen that strikeouts have steadily increased over time. But why have strikeouts increased? Aren’t strikeouts a reason why teams lose? Why would teams want to strikeout more? This is the question I will be focusing on next.

Code
###Scatter plots 
df = summary_stats

# Convert the data frame to long format
df_long <- df %>%
  pivot_longer(-yearID, names_to = "variable", values_to = "value")

# Create a scatter plot for each variable
df_long %>%
  ggplot(aes(x = yearID, y = value)) + 
  geom_smooth(method = "lm")+
  geom_point() + 
  labs(x = "Time", y ="Statistic",
       caption = "Source: Lahman’s Baseball Database - \"Teams\"") +
  ggtitle("Per Game Baseball Stats Over Time") +
  facet_wrap(~ variable, scales = "free_y") +
   theme_bw() +
  theme(plot.caption = element_text(size =8, hjust = 0, vjust= -2.1, face="italic")) +
  theme(plot.title = element_text(hjust = 0.5, vjust = 2))

Strikeouts

  • New question: What variables predict strikeouts?

  • To see which variables predict strikeouts, I performed a regression analysis. A model containing yearID, hits, and home runs as predictors for strikeouts had a relatively high r squared value of .79.

  • This model also showed a strong positive relationship between home runs and strikeouts, and a weaker negative relationship between hits and strikeouts.

    Code
    library(teamcolors)
    library(modelsummary)
    library(Lahman)
    library(gganimate)
    library(gapminder)
    library(gifski)
    library(plotly)
    # What are stikeouts correlated with?
    
    b1 <- b
    
    
    # Cleaning Names
    b1 = b1 %>% 
    mutate(name = ifelse(name == "Los Angeles Angels of Anaheim", "Los Angeles Angels", name),
           name = ifelse(name == "California Angels", "Los Angeles Angels", name),
           name = ifelse(name == "Anaheim Angels", "Los Angeles Angels", name),
           name = ifelse(name == "Tampa Bay Devil Rays", "Tampa Bay Rays", name),
           name = ifelse(name == "Cleveland Guardians", "Cleveland Indians", name),
           name = ifelse(name == "Montreal Expos", "Washington Nationals", name),
           name = ifelse(name == "Florida Marlins", "Miami Marlins", name))
    
    
    modelsummary(lm(SO ~ yearID + H + HR, data = b1))
     (1)
    (Intercept) −118.488
    (3.328)
    yearID 0.065
    (0.002)
    H −0.696
    (0.032)
    HR 1.196
    (0.077)
    Num.Obs. 1228
    R2 0.792
    R2 Adj. 0.791
    AIC 2097.3
    BIC 2122.8
    Log.Lik. −1043.636
    RMSE 0.57

Visualizations

  • These plots visualize the relationship between strikeouts and home runs and strikeouts and hits. The plots also have win percentage indicated by size, team indicated by color, and a slider for the year.
  • The positive relationship between home runs and strikeouts is clearly seen.
  • The negative relationship between hits and strikeouts is also clearly seen.
Code
# Recently, as strikeouts have increased hits have decreased yet homeruns increased.
# Lets visualize this. 

g1 = b1  %>% 
  ggplot(aes(x = SO, y = HR, color = name, fill = name)) +
  geom_point(aes(frame = yearID, size = winpct))+
   scale_fill_teams(guide = FALSE) +
  scale_color_teams(2, guide = FALSE) +
  labs(x = "Strikeouts", 
       y = "Homeruns", 
       title  = "Strikeouts vs Homeruns by Team since 1980 (Per Game)",
       caption = "Source: Lahman’s Baseball Database - \"Teams\"") +
  theme_bw() +
    theme(legend.key.height= unit(1.2, 'cm'),
        legend.key.width= unit(1, 'cm'))

ggplotly(g1) %>% 
  layout(margin = list(l = 50, r = 50, b = 100, t = 50),
         annotations = list(x = .4, y = -.24, text = "Source: Lahman’s Baseball Database - \"Teams\"", xref='paper', yref='paper', showarrow = F,  xanchor='right', yanchor='auto', xshift=0, yshift=0,font = list(size = 7)))
Code
g2 = b1  %>% 
  ggplot(aes(x = SO, y = H, color = name, fill = name)) +
  geom_point(aes(frame = yearID, size = winpct))+
   scale_fill_teams(guide = FALSE) +
  scale_color_teams(2, guide = FALSE) +
  labs(x = "Strikeouts", 
       y = "Hits", 
       title  = "Strikeouts vs Hits by Team since 1980 (Per Game)",
       caption = "Source: Lahman’s Baseball Database - \"Teams\"") +
  theme_bw() +
    theme(legend.key.height= unit(1.2, 'cm'),
        legend.key.width= unit(1, 'cm')) 

ggplotly(g2)%>% 
  layout(margin = list(l = 50, r = 50, b = 100, t = 50),
         annotations = list(x = .4, y = -.24, text = "Source: Lahman’s Baseball Database - \"Teams\"", xref='paper', yref='paper', showarrow = F,  xanchor='right', yanchor='auto', xshift=0, yshift=0,font = list(size = 7)))

Runs

  • So, what’s more valuable for scoring runs, hits or home runs? A model containing hits, and home runs as predictors for runs had a relatively high r squared value of .82.

  • This model also showed a strong positive relationship between home runs and runs, and a weaker positive relationship between hits and runs. The next tab visualizes the relationships.

Code
modelsummary(lm(R ~HR + H, data = b1))
 (1)
(Intercept) −1.762
(0.102)
HR 1.178
(0.025)
H 0.578
(0.012)
Num.Obs. 1228
R2 0.822
R2 Adj. 0.821
AIC −187.9
BIC −167.4
Log.Lik. 97.949
RMSE 0.22

Visualizations

  • The first plot shows a clear linear relationship between home runs and runs.
Code
## Looking at the relationship between Hits and Runs and Homeruns and Runs

g3 = b1  %>% 
  ggplot(aes(x = HR, y = R, fill = winpct)) +
  geom_point(shape =21, size =1.8)+
  labs(x = "Homeruns per Game", 
       y = "Runs per Game", 
       title  = "Homeruns vs Runs by Team since 1980",
       fill = 'Win Percentage',
       caption = "Source: Lahman’s Baseball Database - \"Teams\"") +
  theme_bw() +
    theme(legend.key.height= unit(1.2, 'cm'),
        legend.key.width= unit(1, 'cm')) +
    theme(plot.caption = element_text(size =8, hjust = 0, vjust= -2.1, face="italic"))+ scale_fill_gradient(low = "yellow", high = "blue") +
  geom_smooth(color="red") 

g3

  • However, the second plot shows a different relationship between hits and runs. Runs stay constant as hits increases until hits reach a value of about 8.5. After 8.5 hits, the relationship seems more linear.

  • So, this analysis shows that a team must reach a certain amount of hits before they begin scoring more runs. It also shows that home runs always guarantee that more runs are scored.

    Code
    g4 = b1  %>% 
      ggplot(aes(x = H, y = R, fill = winpct)) +
      geom_point(shape =21, size =1.8)+
      labs(x = "Hits per Game", 
           y = "Runs per Game", 
           title  = "Hits vs Runs by Team since 1980",
           fill = 'Win Percentage',
           caption = "Source: Lahman’s Baseball Database - \"Teams\"") +
      theme_bw() +
        theme(legend.key.height= unit(1.2, 'cm'),
            legend.key.width= unit(1, 'cm')) +
        theme(plot.caption = element_text(size =8, hjust = 0, vjust= -2.1, face="italic"))+ scale_fill_gradient(low = "yellow", high = "blue") +
        geom_smooth(color="red")
    
    g4

Conclusions

  • So, the final conclusion is that strikeouts have increased over time because teams have begun to value home runs more than hits. Teams are willing to to strikeout more if it means they hit more home runs which translate to guaranteed runs.

  • These final plots visually show the increase of strikeouts and home runs over time. There is a slider to view a particular team.

    Code
    #So, yes. Teams have realized the value of home runs is greater than general hits like singles.
    #So, over time, teams have built teams that have hit more home runs while also striking out more.
    
    g5 = b1  %>% 
      ggplot(aes(x = yearID, y = HR, color= name)) +
      geom_line(aes(frame = franchID))+
      labs(x = "Year", 
           y = "Homeruns per Game", 
           title  = "Homeruns by Team since 1980",
           caption = "Source: Lahman’s Baseball Database - \"Teams\"") +
      scale_color_teams() +
      theme_bw() +
        theme(legend.key.height= unit(1.2, 'cm'),
            legend.key.width= unit(1, 'cm')) +
        theme(plot.caption = element_text(size =8, hjust = 0, vjust= -2.1, face="italic"))
    
    ggplotly(g5)%>% 
      layout(margin = list(l = 50, r = 50, b = 100, t = 50),
             annotations = list(x = .4, y = -.24, text = "Source: Lahman’s Baseball Database - \"Teams\"", xref='paper', yref='paper', showarrow = F,  xanchor='right', yanchor='auto', xshift=0, yshift=0,font = list(size = 7)))
    Code
    g6 = b1  %>% 
      ggplot(aes(x = yearID, y = SO, color= name)) +
      geom_line(aes(frame = franchID))+
      labs(x = "Year", 
           y = "Strikeouts per Game", 
           title  = "Strikeouts by Team since 1980",
           caption = "Source: Lahman’s Baseball Database - \"Teams\"") +
      scale_color_teams() +
      theme_bw() +
        theme(legend.key.height= unit(1.2, 'cm'),
            legend.key.width= unit(1, 'cm')) 
    
    ggplotly(g6)%>% 
      layout(margin = list(l = 50, r = 50, b = 100, t = 50),
             annotations = list(x = .4, y = -.24, text = "Source: Lahman’s Baseball Database - \"Teams\"", xref='paper', yref='paper', showarrow = F,  xanchor='right', yanchor='auto', xshift=0, yshift=0,font = list(size = 7)))